In [196]:
%%capture
%run "5 - Statistics.ipynb"
%run "8 - Gradient Descent.ipynb"
import matplotlib.pyplot as plt
import random
%matplotlib inline
In [197]:
def predict(alpha, beta, x_i):
return beta * x_i + alpha
What should we use for alpha
and beta
? Suppose we know the desired output y_i
, we can calculate the error for our inputs:
In [198]:
def error(alpha, beta, x_i, y_i):
return y_i - predict(alpha, beta, x_i)
Now we can calculate the errors across the entire data set:
In [199]:
def sum_of_squared_errors(alpha, beta, x, y):
return sum(error(alpha, beta, x_i, y_i)**2 for x_i, y_i in zip(x, y))
Now we just need to find the inputs that minimize the sum of squared errors:
In [200]:
def least_squares_fit(x, y):
beta = correlation(x, y) * standard_deviation(y) / standard_deviation(x)
alpha = mean(y) - beta * mean(x)
return alpha, beta
In [201]:
alpha, beta = least_squares_fit(num_friends_clean, daily_minutes_clean)
alpha, beta
Out[201]:
In [202]:
predict(alpha, beta, 20)
Out[202]:
In [203]:
plt.title('Simple Linear Regression Model');
plt.ylabel('minutes per day');
plt.xlabel('# of friends')
plt.scatter(num_friends_clean, daily_minutes_clean);
plt.plot(range(0, 50), [predict(alpha, beta, x) for x in range(0, 50)], color='green');
Our model is pretty good for how simple it is! We can measure how well a model does using the coefficient of determination (aka. R-squared). This measures the fraction of the total amount of variation in the dependent variable that is predicted by the model.
In [204]:
def total_sum_of_squares(y):
"""the total squared variation of y_i's from their mean"""
return sum(v ** 2 for v in de_mean(y))
def r_squared(alpha, beta, x, y):
"""the fraction of variation in y captured by the model, which equals
1 - the fraction of variation in y not captured by the model"""
return 1.0 - (sum_of_squared_errors(alpha, beta, x, y) / total_sum_of_squares(y))
r_squared(alpha, beta, num_friends_clean, daily_minutes_clean) # 0.329
Out[204]:
Higher R-squared scores represent a better fitting model. 1 is the highest that an R-squared score can go.
In [205]:
def squared_error(x_i, y_i, theta):
alpha, beta = theta
return error(alpha, beta, x_i, y_i)**2
def squared_error_gradient(x_i, y_i, theta):
alpha, beta = theta
return [-2 * error(alpha, beta, x_i, y_i), # alpha partial derivative
-2 * error(alpha, beta, x_i, y_i) * x_i] # beta partial derivative
# choose random value to start
random.seed(0)
theta = [random.random(), random.random()]
alpha, beta = minimize_stochastic(squared_error, squared_error_gradient, num_friends_clean, daily_minutes_clean, theta, 0.0001)
alpha, beta
Out[205]:
In [ ]: